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Abstract 

This  paper  reports  on  the  work  that  has  been  con¬ 
ducted  by  Fraunhofer  SCAI  for  Trec  Chemistry 
(Trec-Chem)  track  2009.  The  team  of  Fraunhofer 
SCAI  participated  in  two  tasks,  namely  Technology 
Survey  and  Prior  Art  Search.  The  core  of  the  frame¬ 
work  is  an  index  of  1.2  million  chemical  patents  pro¬ 
vided  as  a  data  set  by  Trec.  For  the  technology 
survey,  three  runs  were  submitted  based  on  seman¬ 
tic  dictionaries  and  noun  phrases.  For  the  prior  art 
search  task,  several  fields  were  introduced  into  the  in¬ 
dex  that  contained  normalized  noun  phrases,  biomed¬ 
ical  as  well  as  chemical  entities.  Altogether,  36  runs 
were  submitted  for  this  task  that  were  based  on  au¬ 
tomatic  querying  with  tokens,  noun  phrases  and  en¬ 
tities  along  with  different  search  strategies. 

1  Introduction 

Text  processing  in  chemistry  is  more  formidable  in 
comparison  to  other  fields  due  to  the  presence  of  dif¬ 
ferent  possibilities  to  represent  chemical  name  men¬ 
tions  such  as  trivial  names,  iuPAcQjE]  ,  brand  names, 
InCh^]  and  SMILES  [7|.  For  example,  the  drug 
name  “Aspirin”  is  reported  to  have  25  synonyms  and 
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95  brand  names  in  DrugBanlQ  In  order  to  address 
this  challenge,  Trec  provides  a  workbench  for  large 
scale  evaluation  and  comparison  of  different  tech¬ 
niques  for  text  retrieval  in  chemistry.  Trec-Chem 
addresses  this  challenge  in  terms  of  two  independent 
tasks.  The  first  task,  namely  the  technology  survey, 
consists  of  18  expert-defined  natural  language  expres¬ 
sions  of  the  information  needed  and  the  task  is  to 
retrieve  a  set  of  documents  from  a  predefined  collec¬ 
tion  that  can  best  answer  the  questions.  The  second 
task,  namely  prior  art  search,  consists  of  1000  test 
patents  and  the  task  is  to  retrieve  sets  of  documents 
invalidating  each  test  patent. 

Considering  the  ambiguity  inherent  to  the 
chemistry-based  literature,  our  approach  focused  on 
tagging  the  chemical  and  biomedical  named  entities 
in  the  documents.  Tagging  the  entities  and  map¬ 
ping  them  to  standard  database  entries  normalizes 
different  forms  of  the  same  entity  to  one  standard 
form.  This  helps  to  overcome  the  problems  associated 
with  multiple  synonyms,  acronyms  and  morphologi¬ 
cal  variants  in  text.  Moreover,  document  retrieval 
based  on  semantically  tagged  entities  has  demon¬ 
strated  variable  success  in  the  past  is  eh  ns-  a 
precondition  for  such  an  approach  is  the  availability 
of  comprehensive  and  domain  specific  dictionaries  as 
well  as  named  entity  recognition  techniques.  Since 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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the  entities  in  chemistry  patent  space  are  not  as  well 
explored  as  in  biomedical  space,  we  propose  to  tag 
the  noun  phrases  and  normalize  them  to  their  base 
form  before  further  assessments.  From  the  querying 
and  retrieval  point  of  view,  the  performance  of  re¬ 
trieval  using  tokens,  noun  phrases,  entities  as  well  as 
their  combinations  has  been  evaluated. 

The  Sections  [2]  and  [3]  describe  the  workflow  used 
for  the  technology  survey  and  the  prior  art  search 
task  respectively.  Section  [4]  provides  the  experimen¬ 
tal  results  of  both  the  tasks  and  Section  [5]  provides 
the  concluding  remarks. 


2  Technology  Survey  Task 

The  data  provided  for  the  Technology  Survey  (TS) 
task  contain  approximately  1.2  million  patents  from 
European  and  US  patent  offices,  51,000  full  text  ar¬ 
ticles  from  Royal  Society  of  Chemistry  (RSC)  and 
18  topics  that  are  formulated  by  human  experts  as  a 
natural  language  narrative.  The  task  is  to  retrieve  a 
set  of  documents  from  the  corpus  that  can  best  an¬ 
swer  the  question.  An  example  of  such  as  topic  is 
“TS-  7:  Please  identify  documents  with  formulations 
of  minitabs,  containing  a  Factor  Xa  inhibitor”. 

2.1  Data  Preprocessing 

The  Trec  corpus  collection  was  provided  in  Extensi¬ 
ble  Markup  Language  (XML) .  As  a  preliminary  mea¬ 
sure,  an  analysis  of  different  sections/zones  within 
the  patents  and  RSC  articles  was  performed.  Patent 
documents  contain  several  fields  that  are  presumably 
not  necessary  during  retrieval  and  generate  substan¬ 
tial  noise  while  processing  the  documents.  Examples 
of  such  fields  are  country ,  bibliographic  data ,  legal- 
status,  or  non-English  abstracts.  Similar  examples 
within  RSC  articles  are  number  of  pages,  citations, 
or  editor.  The  aim  was  to  use  only  those  fields  that 
have  high  text /noise  ratio  and  that  encompass  rich 
information  content.  Therefore,  with  a  retrieval  point 
of  view,  the  following  fields  were  chosen  to  be  used 
for  indexing  and  further  assessments: 


Patents  UCIlJ/]  Publication  date,  Authors,  IPC0 
class,  Title,  Abstract,  Description  and  Claims. 

RSC  articles  DO^J  Publication  date,  Authors,  Ar¬ 
ticle  body  (front)  and  Article  body  (back). 

2.2  Indexing 


4 - Information  flow  from  TREC  corpus 

< - Information  flow  from  test  topic/patent 

Figure  1:  An  overview  of  the  workflow  implemented 
for  the  technology  survey  and  the  prior  art  search 
tasks.  For  the  technology  survey,  entity  classes  occur¬ 
ring  within  an  expert-defined  topic  (i.e.  test  topic) 
are  used  for  querying.  Whereas  for  the  prior  art 
search  task,  entity  classes  occurring  within  the  test 
patent  are  used  for  querying 
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Based  on  preprocessing,  the  documents  were  in¬ 
dexed  using  SCAIView  [3j.  Figure  [l]  provides  an 
overview  of  the  methodology  used  for  indexing. 
SCAIView  [d]  is  a  high  performing  and  scalable  Infor¬ 
mation  Retrieval  (IR)  system  based  on  Lucene  [3]  ■  It 
provides  a  framework  for  indexing  several  gigabytes 
of  document  data  and  to  quickly  perform  complex 
searches  over  text  as  well  as  named  entities.  The  scor¬ 
ing  algorithm  considers  the  frequency  of  a  particular 
term  within  individual  documents  and  the  frequency 
of  the  term  in  all  documents.  Only  the  fields  men¬ 
tioned  in  the  previous  section  were  used  for  indexing 
while  the  remaining  information  were  not  considered 
for  both  patents  as  well  as  RSC  articles. 

2.3  Querying  and  Retrieval 

Considering  the  time  constraints,  named  entities  or 
noun  phrases  were  not  incorporated  into  the  index 
for  this  task.  Nevertheless,  the  impact  of  available  se¬ 
mantic  dictionaries  was  tested.  The  dictionaries  used 
for  this  task  are  mentioned  in  Section  3.2.  Three 
runs  have  been  submitted  for  the  technology  sur¬ 
vey.  For  the  first  two  runs,  18  independent  topic 
specific  dictionaries  were  generated  whereas  for  the 
third  run  18  independent  noun  phrase  based  queries 
were  used.  During  the  first  run,  precompiled  dictio¬ 
naries  were  used  for  named  entity  recognition  and 
the  results  were  ordered  based  on  hit  frequencies.  In 
the  second  run,  the  dictionary  entries  were  queried 
using  SCAIView  and  the  results  were  ranked  based 
on  Lucene’s  similarity  score.  For  the  final  run,  noun 
phrases  were  used  for  querying  and  the  results  were 
ranked  according  to  the  Lucene’s  similarity  score. 

2.3.1  Runl:  SCAI09TSPM 

For  this  task,  the  basic  idea  was  to  apply  named  en¬ 
tity  recognition  and  automatically  build  18  indepen¬ 
dent  and  topic  specific  dictionaries  with  the  found 
entities.  If  the  named  entities  were  automatically 
recognized  within  TS  topic,  they  were  directly  used 
along  with  their  synonyms  as  present  within  the  dic¬ 
tionary.  But  as  assumed  before  the  dictionaries  were 
not  comprehensive  to  include  entities  present  in  all 
the  provided  topics.  We  found  no  hits  in  a  num¬ 


ber  of  topics,  for  example  in  Synthetic  routes  used  to 
perform  Diels- Alder  reaction  on  a  multi-gram  scale. 
If  no  hits  were  detected,  the  entities  were  recognized 
manually  and  expanded  with  their  synonyms.  Exam¬ 
ple  for  automatically  found  and  manually  generated 
entries  are  given  in  Table  [l] 

The  dictionaries  generated  for  the  TS  tasks  were 
used  within  the  ProMiner  framework  [3]  for  identifi¬ 
cation  of  potential  term  mentions  within  the  corpus  of 
patents  as  well  as  RSC  articles.  The  aim  was  to  iden¬ 
tify  those  documents  that  contain  terms  present  in 
the  dictionary  and  the  documents  were  ranked  based 
on  simple  term  frequencies. 

According  to  the  definition  of  the  Trec  task,  we 
were  supposed  to  submit  the  patent  identifiers  with¬ 
out  ‘patent-type’  information.  Therefore,  from  all 
revisions  of  a  patent,  the  one  with  maximum  score 
was  reported. 

2.3.2  Run2:  SCAI09TSMAN 

This  semi-automatic  process  is  intended  to  give  a 
baseline  for  retrieval  performance  of  non-patent  ex¬ 
perts.  For  this  task,  the  same  dictionaries  gener¬ 
ated  for  SCAI09TSPM  were  used.  The  queries  were 
performed  using  the  SCAIView  search  engine  and 
the  documents  were  retrieved  and  ranked  based  on 
Lucene’s  similarity  score.  The  results  were  filtered  to 
exclude  information  about  the  patent-type  from  the 
retrieved  patent  identifiers  similar  to  the  process  ex¬ 
plained  in  the  previous  section.  Considering  the  lim¬ 
ited  set  of  TS  topics,  we  did  not  index  the  chemicals 
or  biomedical  terms  but  rather  expanded  the  queries 
manually  with  their  corresponding  synonyms.  An  ex¬ 
ample  query  for  TS-15  is: 

("Betaine"  OR  "Glycine  betaine"  OR 

"Glycocol  betaine"  OR  "Glycylbetaine"  OR 

...)  AND  ("Peripheral  Artery  disorder" 

OR  "Peripheral  Arterial  Disease"  OR  ...) 

2.3.3  Run3:  SCAI09TSNP 

The  principle  behind  SCAI09TSNP  run  was  to  per¬ 
form  the  task  in  an  automatic  way  based  on  noun¬ 
phrase  detection  incorporating  the  OpenNLP  chun- 
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Informative  Term 

Synonyms 

Source 

Betaine 

Glycine  betaine, 

Glycocol  betaine, 

Glycylbetaine  etc. 

ATC 

Peripheral  Artery  Disease 

Peripheral  Artery  Disorder, 
Peripheral  Arterial  Disease  etc. 

MeSH 

Diels- Alder  reaction 

Diels  Alder  reaction 

Diels  Alder  mechanism  etc. 

Manual 

Table  1:  Synonyms  and  their  sources  for  informative  terms  within  TS  topic 


keiQ  Since  noun  phrases  provide  substantial  infor¬ 
mation  like  head  nouns  and  their  modifiers,  the  idea 
is  to  use  noun  phrases  as  queries.  Noun  phrases  from 
each  topic  description  were  collected  separately  and 
directly  used  for  querying.  For  this  run,  the  queries 
were  not  expanded  with  synonyms  or  normalized  to 
unique  base  forms.  An  example  of  query  for  TS-15 
generated  with  NP  chunker  is: 

("cardiovascular"  AND  "betaines"  AND 

"peripheral  arterial  disease") 

3  Prior  Art  Search  Task 

The  data  provided  for  the  Prior  Art  (PA)  search  task 
contains  approximately  1.2  million  patents  from  Eu¬ 
ropean  and  US  patent  offices  and  1000  test/query 
patents.  The  task  is  to  retrieve  sets  of  documents 
from  corpus  invalidating  each  query  document.  An 
example  of  such  a  task  is  “PA-1:  Find  all  patents  in 
the  collection  that  would  potentially  be  able  to  invali¬ 
date  patent  EP-0327505”. 

3.1  Data  Preprocessing 

The  same  preprocessing  as  for  the  TS  task  was  incor¬ 
porated  for  the  PA  task  such  as  selection  of  informa¬ 
tive  fields  and  extraction  of  plain  text.  An  analysis  of 
IPC  classes  was  conducted  for  the  large  patent  cor¬ 
pus  as  well  as  query  patent  subcorpus.  The  results  of 
IPC  class  analysis  indicate  that  the  superclasses  A61 

‘ http : / /opennlp . sourcef orge . net/pro j  ects . html ,  last 
accessed  October  2009 


“Medical  or  Veterinary  Science”  and  C07  “Organic 
Chemistry”  dominate  the  corpus  with  more  than  70% 
of  the  total  patents  provided. 


3.2  Named  Entity  Recognition 


The  analysis  of  IPC  classes  mentioned  in  Section  3.1 
has  shown  that  organic  chemistry,  biomedicine  and 
biochemistry  occupies  a  large  part  of  the  corpus. 
The  hypothesis  is  that  named  entity  recognition  of 
chemicals  and  biomedical  terms  helps  to  overcome 
the  problems  associated  with  synonyms  by  automatic 
query  expansion.  ProMiner  was  used  for  the  task  of 
entity  recognition  with  different  dictionaries.  Named 
Entity  Recognition  was  performed  independently  on 
Title,  Abstract,  Claims  and  Description  and  indexed 
separately.  The  following  entity  classes  have  been 
used: 


Chemical  Names  Chemical  names  including  syn¬ 
onyms,  formulae,  IUPAC,  and  brand  names  of 
chemical  compounds  as  extracted  from  Drug- 
Bank,  KEGG^]  Drugs  and  KEGG  Chemicals. 
Additionally,  IUPAC-like  names  as  detected 
with  a  machine  learning  based  system  [BJ  are  in¬ 
corporated.  It  performs  an  internal  normaliza¬ 
tion  to  map  different  variants  to  one  base  form. 

Genes/Proteins  Human  genes  and  protein  names 
as  well  as  their  synonyms  that  are  extracted  from 
EntrezGen^]  and  UniProtp^j  [2], 

''http://www.genome.jp/kegg/  last  accessed  October  2009 
“http : //www .ncbi .nlm.nih.gov/ sites/ entrez?db=gene 
last  accessed  October  2009 

1(  http://www.uniprot.org/  last  accessed  October  2009 
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Diseases  Disease  names  and  their  synonyms  that 
are  extracted  from  the  Medical  Subject  Head¬ 
ings  (MeSH^ 

Pharma  Terms  Pharmacological  terms  that  are 
extracted  from  the  Anatomical  Therapeutic 
Chemical  (ATCj[^]  drug  classification  system. 
Since  ATC  does  not  contain  synonyms  and  term 
variants,  this  information  was  gathered  from 
UMLS  with  the  help  of  the  MetaMap  pro¬ 
gram  [5]- 

3.3  Noun  Phrase  Recognition  with 
NP  Chunker 

As  described  in  Section  [231  noun  phrases  designate  a 
good  source  of  information  content  from  text.  There¬ 
fore,  the  OpenNLP-based  NP  chunker  was  applied  to 
recognize  all  noun  phrases  that  occur  in  the  query 
patent  corpus  that  resulted  in  549,921  phrases.  From 
the  extracted  noun  phrases,  1000  of  them  were  ran¬ 
domly  selected  and  manually  classified  as  informa¬ 
tive  or  not.  Table  [2]  shows  some  noun  phrases  ex¬ 
amples  for  both  classes.  Since  only  52%  of  the  ex¬ 
tracted  noun  phrases  were  found  to  be  informative, 
a  rule  based  filtering  step  was  incorporated  to  re¬ 
move  the  non-informative  noun  phrases.  After  fil¬ 
tering,  194,322  noun  phrases  remained  with  70%  in¬ 
formative  terms.  In  a  last  step,  the  noun  phrases 
were  normalized  using  the  Norm  prograup~^]  provided 
within  Specialist  NLP  package  by  National  Library  of 
Medicine  (NLM) .  Norm  creates  an  abstract  represen¬ 
tation  of  text  strings  ignoring  alphabetic  case,  inflec¬ 
tion,  spelling  variants,  punctuation,  genitive  markers, 
stop  words,  diacritics,  symbols,  ligatures,  and  word 
order.  After  normalization,  the  noun  phrases  with 
similar  base  forms  were  mapped  onto  each  other  to 
generate  a  noun  phrase  dictionary  which  was  then 
used  within  ProMiner  for  recognition  of  potentially 
useful  noun  phrases  occurring  in  the  patent  corpus. 

nhttp : //www.nlm.nih . gov/mesh/  last  accessed  October 
2009 

http : //www. genome . jp/kegg/brite . html  last  accessed 
October  2009 

13  http : / /lexsrv3  .nlm  .nih.  gov/SPECIALIST/ index.html  , 

last  accessed  October  2009 


3.4  Indexing 

Following  data  preprocessing  and  name  entity  recog¬ 
nition,  the  document  texts  as  well  as  the  biomedical 
and  chemical  entities  occurring  within  them  were  in¬ 
dexed  with  SCAIView.  Figure  [T]  shows  an  overview 
of  the  workflow  implemented  for  the  PA  task. 

The  entity  recognition  with  different  dictionaries 
was  performed  separately  over  title,  abstract,  descrip¬ 
tion  and  claim  section  of  each  document.  Addition¬ 
ally,  the  entities  that  occur  within  different  sections 
were  indexed  as  separate  fields.  Unlike  a  conven¬ 
tional  index  that  contains  only  tokens,  the  used  index 
additionally  contains  noun  phrases,  chemicals  and 
biomedical  entities.  Finally,  the  index  had  34  fields: 
ID,  Authors,  Title,  Publication  date,  IPC  class,  Ab¬ 
stract,  Claims,  Description,  Chemical  names  (occur¬ 
ring  in  Title,  Abstract,  Claims  and  Description), 
IUPAC-like  (occurring  in  Title,  Abstract,  Claims  and 
Description),  Genes/Proteins  (occurring  in  Title,  Ab¬ 
stract,  Claims  and  Description),  Pharma  terms  (oc¬ 
curring  in  Title,  Abstract,  Claims  and  Description), 
Diseases  (occurring  in  Title,  Abstract,  Claims  and 
Description),  and  Noun  Phrases  (occurring  in  Title, 
Abstract,  Claims  and  Description).  Table  [3]  shows 
the  frequency  of  different  entities  occurring  in  the 
entire  corpus  as  well  as  number  of  documents  that 
contain  at  least  one  entity  of  interest.  Chemical  and 
biomedical  entities  that  are  present  in  our  dictio¬ 
naries  occur  within  a  large  portion  of  patent  cor¬ 
pus.  Noun  Phrases  do  not  occur  in  all  1.2  million 
patents  because  only  the  noun  phrases  that  occur 
within  query  corpus  were  tagged  and  the  remaining 
noun  phrases  were  excluded. 

3.5  Querying  and  Retrieval 

Altogether,  36  runs  were  submitted  for  the  prior  art 
search  task.  The  queries  were  performed  using  dif¬ 
ferent  combinations  of  entity  types  occurring  in  the 
query  patent  and  different  search  strategies.  The 
results  were  filtered  based  on  priority  dates  which 
means  the  priority  date  of  the  retrieved  patent  needed 
to  be  earlier  than  the  priority  date  of  query  patent. 
In  order  to  understand  the  importance  of  filtering, 
both  filtered  as  well  as  unfiltered  results  were  sub- 
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Informative  Noun  Phrases 

Non-informative  Noun  Phrases 

copper  strip  test 
methoxypropynyl  group 
biodegradable  collagen 
self-adhesive  CODAL  tape 
tyrosine  kinase  inhibitor 

1  2  3 1 2  m  4  R=H 

1200  W  13.56  MHz  RF  power 
about  1800  mg/kg 

A)1>[M  M]/(4  [M  M]  [M  M]) 
such  difficulties 

Table  2:  Examples  of  extracted  noun  phrases. 

No.  of  unique  entities 

No.  of  documents  with  one  or  more  entities 

Entity  Class 

Large  Corpus 

Query  Corpus 

Large  Corpus 

Query  Corpus 

Chemical  Names 

14,342 

2,661 

933,468 

869 

IUPAC-like 

8,504,912 

17,972 

817,606 

629 

Pharma  Terms 

449 

193 

892,736 

431 

Genes/Proteins 

4,246 

639 

548,428 

246 

Diseases 

18,458 

425 

824,415 

196 

Noun  Phrases 

182,388 

190,528 

1,176,217 

1000 

Table  3:  Frequencies  of  dictionary  entries  occurring  within  the  the  large  corpus  as  well  as  the  query  corpus 
and  numbers  of  documents  containing  at  least  one  entity  of  interest. 


mitted.  Table  [5]  shows  all  the  submitted  runs  along 
with  run  identifiers,  entity  types  and  sections  used 
for  querying,  and  an  indication  whether  the  results 
were  filtered  or  not. 

Different  objects  that  were  used  for  querying  are: 


Tokens  Search  with  all  tokens  that  occur  in  a  query 
patent 

Noun  Phrases  Search  with  all  noun  phrases  that 
occur  in  a  query  patent 

Entities  Search  with  all  chemical  entities  (chemi¬ 
cal  names  and  IUPAC-like)  and  biomedical  enti¬ 
ties  (pharma  terms,  genes/proteins  and  diseases) 
that  occur  in  a  query  patent. 

The  different  search  strategies  are: 

Full  Document  Search  in  title,  abstract,  claims  and 
description. 


Weighted  Zones  Search  different  sections  of  docu¬ 
ment  with  different  boosting  factor.  The  boost¬ 
ing  factors  were  set  to  3  for  abstract,  to  2  for 
claims,  to  1.5  for  description  and  to  2  for  title. 

Description  Only  Search  within  description  sec¬ 
tion  only. 

Claims  Only  Search  within  claims  section  only. 

Full  Document  and  IPC  Class  Search  in  title, 
abstract,  claims,  description  and  give  high  prior¬ 
ity  to  retrieved  documents  that  have  same  IPC 
class  as  query  document. 

In  Table  [5]  boosting  indicates  a  run  with  a  high 
boosting  factor  of  3  assigned  for  all  chemical  en¬ 
tities,  noun  phrases  or  noun  phrases  that  co-occur 
within  same  sentences  of  the  query  document,  respec¬ 
tively.  The  assumption  behind  the  latter  was  that  co¬ 
occurring  noun  phrases  would  be  descriptive  to  un¬ 
derstand  the  context  of  the  document  and  they  would 
serve  as  a  good  source  for  information  retrieval. 
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Run  ID 

SCAI09TSPM 

SCAI09TSMAN 

SCAI09TSNP 

nDCG 

0.357 

0.493 

0.446 

Table  4:  Results  of  the  Technology  Survey  Task.  Evaluations  are  based  on  nDCG  score. 


4  Results 

4.1  Results  of  the  Technology  Survey 
Task 

For  the  TS  task,  the  reported  results  are  based  on 
the  normalized  Discounted  Cumulative  Gain  (nDCG) 
score  |12|.  Table  [4]  shows  nDCG  scores  of  all  the 
officially  submitted  runs  for  this  task. 

The  run  SCAI09TSMAN  based  on  manually  for¬ 
mulated  queries  resulted  in  the  best  nDCG  score  of 
0.493.  The  automatic  run  SCAI09TSNP  using  only 
noun  phrases  without  query  expansion  resulted  in 
a  slightly  lower  score  of  0.446.  Run  SCAI09TSPM 
using  entity  recognition  and  term  frequency-based 
ranking  performed  worse.  This  indicates  the  impor¬ 
tance  and  role  of  scoring  functions  as  used  by  Lucene 
for  ranking  the  relevance  of  retrieved  documents. 

4.2  Results  of  the  Prior  Art  Search 
Task 

For  the  PA  task,  the  reported  results  are  based  on  the 
Binary  Preference  score  ( bpref )  [T].  Table  [5]  shows 
bpref  scores  for  all  the  officially  submitted  runs  for 
this  task. 

The  token-based  full  document  search  with  IPC 
class  outperformed  entity-based  and  noun  phrase- 
based  searches.  Filtering  the  results  based  on  the 
priority  date  showed  mild  improvement  in  the  per¬ 
formance  of  retrieval  when  compared  to  unfiltered 
results.  Weighted  zone  search  by  boosting  different 
subsections  of  the  document  showed  to  be  promis¬ 
ing  in  comparison  to  normal  full  document  search  or 
only  description  or  claim  search.  An  interesting  ob¬ 
servation  is  that  the  claim  search  which  is  broadly 
employed  by  patent  experts  for  invalidity  search  or 
prior  art  search  reported  poor  results  with  token 
based,  noun  phrase  based  as  well  as  entity  based 
search.  Weighting/Boosting  the  entities  does  not 


seem  to  be  helpful  but  a  run  where  only  noun  phrases 
were  boosted  (SCAI09PAf4b)  performed  slightly  bet¬ 
ter  than  weighting  the  entities.  An  assumption  of 
boosting  co-occurring  noun  phrases  (SCAI09PAf4c) 
during  querying  indicated  a  downfall.  Nevertheless, 
the  importance  of  zone  weighting  and  IPC  class  for 
patent  retrieval  was  demonstrated. 

4.3  Post-TREC  Results  of  the  Prior 
Art  Search  Task 

The  analysis  of  the  results  of  PA  task  officially  sub¬ 
mitted  to  Trec  showed  that  inclusion  of  zone  weight¬ 
ing  and  the  IPC  class  significantly  improves  the  per¬ 
formance  of  retrieval.  Therefore,  utilizing  this  infor¬ 
mation,  additional  experiments  were  performed  with 
a  different  combination  of  entity  types  used  for  query¬ 
ing.  Table  [6]  shows  the  bpref  scores  of  post-TREC 
runs.  A  combination  of  tokens,  noun  phrases  and 
entities  searched  with  zone  weighting  and  IPC  class 
information  improved  the  performance  of  retrieval. 
The  performance  after  coupling  tokens  with  noun 
phrases  and  tokens  with  entities  showed  notable  gain 
in  the  results.  This  indicates  the  essence  of  using  en¬ 
tities  and  noun  phrases  for  document  retrieval  as  well 
as  combining  them  in  different  ways. 

5  Discussion  and  Conclusion 

After  analyzing  the  scores  achieved  during  the  TS 
task,  the  baseline  method  with  manual  query  for¬ 
mulation  and  query  expansion  showed  better  per¬ 
formance  in  comparison  to  the  other  runs.  How¬ 
ever,  the  automatic  noun  phrase  recognition  com¬ 
bined  with  Lucene-based  retrieval  seems  to  work  con¬ 
siderably  well.  A  better  retrieval  performance  could 
be  achieved  through  the  usage  of  informative  terms 
and  dictionary  expansion.  For  the  PA  task,  the  re¬ 
trieval  performance  using  different  entity  types  as 
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Full 


Document 

Tokens 

0.3601tla 

0.3777fla 

Noun  Phrases 

0.3355t2a 

0.3418f2a 

Noun  Phrases  & 
Entities 

0.3369t3a 

0.3380f3a 

Noun  Phrases  & 
Entities  (Boo¬ 
st  Chemicals) 

0.3166t4a 

0.3181f4a 

Noun  Phrases  & 
Entities  (Boost  Noun 
Phrases) 

0.3666t4b 

0.3734f4b 

Noun  Phrases  & 
Entities  (Boost  co¬ 
occurring  NP) 

0.3440t4c 

0.3485f4c 

Weighted 

Description  Only 

Zones 

Only 

0.3826tlb 

0.3336tlc 

0.3894flb 

0.3501flc 

0.3314t2b 

0.3405t2c 

0.3344f2b 

0.3500f2c 

0.3514t3b 

0.3290t3c 

0.3536f3b 

0.3367f3c 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A  N/A 

N/A  N/A 


Claims 

Full  Document  & 
IPC  Class 

0.2138tld 

0.3777tle 

0.2137fld 

0.4004fle 

0  2048t2d 

0.3775t2e 

0.1990f2d 

0.3925f2e 

0.2105t3d 

0.3726t3e 

0.2035f3d 

0.3811f3e 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A 

N/A  N/A 

N/A  N/A 


Table  5:  Results  of  the  Prior  Art  Search  Task.  Scores  having  ‘t’  and  ‘f’  within  run  identifier  indicates 
that  the  results  were  unfiltered  and  filtered  respectively.  The  last  three  characters  of  the  run  identifiers  are 
mentioned  in  the  table.  An  example  of  submitted  run  identifier  looks  like  ‘SCAI09PAtla’.  The  entity  types 
and  document  sections  used  for  querying  are  also  mentioned.  Evaluations  are  based  on  bpref  score. 


Tokens  &  NP  &  Entities 

Tokens  &  NP 

Tokens  &  Entities 

0.4355 

0.4302 

0.4121 

Table  6:  Results  of  the  post-TREC  Prior  Art  Search  Task.  Results  are  filtered  and  the  evaluations  are  based 
on  bpref  score 


well  as  different  search  strategies  was  demonstrated 
along  with  the  importance  of  zone  weighting  and  us¬ 
ing  meta-information  like  IPC  class  for  patent  re¬ 
trieval.  Finally,  it  was  shown  that  querying  with 
a  combination  of  tokens,  noun  phrases  and  entities 
performed  relatively  better  than  using  the  different 
entity  types  alone. 

There  are  several  ways  to  improve  the  performance 
of  retrieval.  Currently,  the  breadth  of  knowledge 
sources  that  has  been  used  is  limited.  For  example, 
only  the  chemicals  present  in  DrugBank  and  KEGG 
databases  have  been  used.  These  databases  are  spe¬ 
cialized  to  include  compounds  that  are  of  biomedical 


interest  and  does  not  focus  on  chemicals  contained  in 
ink  formulations,  cement  or  fertilizers.  Considering 
the  scope  of  IPC  classes  of  the  documents  provided 
within  the  Trec  data  set,  only  50%  of  the  docu¬ 
ments  belong  to  the  biomedical  domain.  Therefore, 
indexing  the  terms  using  broader  resources  that  cover 
terminologies  beyond  the  biomedical  domain  has  to 
be  tested  in  future  approaches.  Using  a  pre-trained 
NP  chunker  [5]  that  has  been  specifically  trained  on 
chemistry-based  patents  is  one  way  to  reduce  the 
noisy  noun  phrases.  Improving  the  recognition  per¬ 
formance  of  the  entity  recognizers  on  patents  can 
also  contribute  to  better  retrieval.  The  methods  pre- 


sented  here  adopt  most  of  the  strategies  from  con¬ 
ventional  document  retrieval  techniques.  However, 
being  at  an  early  stage  of  patent  retrieval,  the  cir¬ 
cumstances  underpin  the  necessity  for  methods  spe¬ 
cialized  for  patent  text  analysis.  Our  future  work  will 
focus  on  overcoming  the  limitations  that  have  been 
mentioned  previously  and  to  optimize  our  retrieval 
system  to  better  adapt  to  chemistry-based  patents. 
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